Skip to content

[Fix-17817][Master]Add workflow timeout event and handle#18063

Open
njnu-seafish wants to merge 13 commits intoapache:devfrom
njnu-seafish:Fix-17817-2
Open

[Fix-17817][Master]Add workflow timeout event and handle#18063
njnu-seafish wants to merge 13 commits intoapache:devfrom
njnu-seafish:Fix-17817-2

Conversation

@njnu-seafish
Copy link
Contributor

Was this PR generated or assisted by AI?

NO

Purpose of the pull request

close #17817

This bug involves a critical feature and has been stagnant for several months. Since I actually resolved the task timeout and workflow timeout issue quite some time ago, I would like to finalize the process and get it closed.
image

Brief change log

Add workflow timeout event and handle

Verify this pull request

This pull request is code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(or)

Pull Request Notice

Pull Request Notice

If your pull request contains incompatible change, you should also add it to docs/docs/en/guide/upgrade/incompatible.md

@SbloodyS SbloodyS added the bug Something isn't working label Mar 13, 2026
@SbloodyS SbloodyS added this to the 3.4.2 milestone Mar 13, 2026
@AllArgsConstructor(access = AccessLevel.PRIVATE)
public class WorkflowTimeoutLifecycleEvent extends AbstractWorkflowLifecycleLifecycleEvent {

private IWorkflowExecutionRunnable workflowExecutionRunnable;

Check notice

Code scanning / CodeQL

Missing Override annotation Note

This method overrides
AbstractWorkflowLifecycleLifecycleEvent.getWorkflowExecutionRunnable
; it is advisable to add an Override annotation.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses issue #17817 by introducing a workflow-level timeout lifecycle event/handler so that configured workflow timeout alerts are actually emitted by the master engine.

Changes:

  • Add a new WorkflowTimeoutLifecycleEvent + WorkflowTimeoutLifecycleEventHandler, and wire publishing from WorkflowStartLifecycleEventHandler.
  • Add TIMEOUT to WorkflowLifecycleEventType to support routing the new event.
  • Add service/DAO plumbing for sending workflow-timeout alerts and refactor alert persistence helper naming; tighten task-timeout validation.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
dolphinscheduler-service/src/main/java/org/apache/dolphinscheduler/service/alert/WorkflowAlertManager.java Adds sendWorkflowTimeoutAlert entrypoint to reach DAO alert creation.
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/workflow/lifecycle/handler/WorkflowTimeoutLifecycleEventHandler.java New handler to send workflow timeout alerts when the timeout event fires.
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/workflow/lifecycle/handler/WorkflowStartLifecycleEventHandler.java Publishes workflow timeout event at workflow start (if timeout configured).
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/workflow/lifecycle/event/WorkflowTimeoutLifecycleEvent.java New delay event representing workflow timeout.
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/workflow/lifecycle/WorkflowLifecycleEventType.java Adds TIMEOUT event type.
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/task/lifecycle/event/TaskTimeoutLifecycleEvent.java Tightens validation to require timeout minutes > 0.
dolphinscheduler-dao/src/main/java/org/apache/dolphinscheduler/dao/AlertDao.java Adds validation for workflow timeout alert inputs and consolidates timeout-alert saving helper.
dolphinscheduler-common/src/main/java/org/apache/dolphinscheduler/common/enums/AlertType.java Updates comment to reflect actual enum values.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +53 to +54
long delayTime = System.currentTimeMillis() - workflowInstance.getStartTime().getTime()
+ TimeUnit.MINUTES.toMillis(timeout);

public static WorkflowTimeoutLifecycleEvent of(IWorkflowExecutionRunnable workflowExecutionRunnable) {
final WorkflowInstance workflowInstance = workflowExecutionRunnable.getWorkflowInstance();
checkState(workflowInstance != null, "The workflow instance must be initialized before retrying.");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

private void workflowTimeoutMonitor(final IWorkflowExecutionRunnable workflowExecutionRunnable) {
final WorkflowInstance workflowInstance = workflowExecutionRunnable.getWorkflowInstance();
if (workflowInstance.getTimeout() <= 0) {
log.debug("The workflow {} timeout {} is invalided, so the timeout monitor will not be started.",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okok

Copy link
Member

@ruanwenjun ruanwenjun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review your PR first, too many of these comments have been raised repeatedly in previous PRs.

final int timeout = workflowInstance.getTimeout();
checkState(timeout > 0, "The workflow timeout: %s must > 0 minutes", timeout);

long delayTime = System.currentTimeMillis() - workflowInstance.getStartTime().getTime()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the workflow instance is rerun the startTime will not change. You should use restartTime .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the workflow instance is rerun the startTime will not change. You should use restartTime .

good

private void workflowTimeoutMonitor(final IWorkflowExecutionRunnable workflowExecutionRunnable) {
final WorkflowInstance workflowInstance = workflowExecutionRunnable.getWorkflowInstance();
if (workflowInstance.getTimeout() <= 0) {
log.debug("The workflow {} timeout {} is invalided, so the timeout monitor will not be started.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log is incorrect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log is incorrect.

fix it

Copy link
Member

@ruanwenjun ruanwenjun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add IT case.

* Pause the workflow instance
*/
PAUSE,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't change the unrelated code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't change the unrelated code.

ok

@github-actions github-actions bot added the test label Mar 20, 2026
@njnu-seafish
Copy link
Contributor Author

You should add IT case.

added

@njnu-seafish njnu-seafish requested a review from ruanwenjun March 20, 2026 08:07
postTaskCode: 1
postTaskVersion: 1
createTime: 2024-08-12 00:00:00
updateTime: 2024-08-12 00:00:00 No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
updateTime: 2024-08-12 00:00:00
updateTime: 2024-08-12 00:00:00

We should leave an blank line at the end of each file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should leave an blank line at the end of each file.

ok

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds workflow-timeout alerting to the Master workflow lifecycle by introducing a timeout lifecycle event + handler, wiring it into workflow start, and validating via a new master integration test.

Changes:

  • Introduce WorkflowTimeoutLifecycleEvent / WorkflowTimeoutLifecycleEventHandler and add TIMEOUT to workflow lifecycle event types.
  • Publish workflow-timeout events on workflow start and add alert sending for workflow timeouts via WorkflowAlertManagerAlertDao.
  • Add a master integration test + YAML fixture to assert a workflow-timeout alert is persisted.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
dolphinscheduler-service/src/main/java/org/apache/dolphinscheduler/service/alert/WorkflowAlertManager.java Adds sendWorkflowTimeoutAlert entrypoint delegating to AlertDao.
dolphinscheduler-master/src/test/resources/it/start/workflow_with_workflow_timeout_alert.yaml New IT fixture defining a workflow with timeout + warningGroupId.
dolphinscheduler-master/src/test/java/org/apache/dolphinscheduler/server/master/integration/cases/WorkflowStartTestCase.java New IT case asserting workflow-timeout alert is created.
dolphinscheduler-master/src/test/java/org/apache/dolphinscheduler/server/master/integration/WorkflowOperator.java Passes warningGroupId into manual trigger request; extends DTO.
dolphinscheduler-master/src/test/java/org/apache/dolphinscheduler/server/master/integration/Repository.java Adds query helper for alerts by workflow instance id.
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/workflow/lifecycle/handler/WorkflowTimeoutLifecycleEventHandler.java New handler that sends workflow-timeout alerts.
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/workflow/lifecycle/handler/WorkflowStartLifecycleEventHandler.java Publishes workflow-timeout event when workflow starts and timeout is configured.
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/workflow/lifecycle/event/WorkflowTimeoutLifecycleEvent.java New delayed lifecycle event representing a workflow timeout.
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/workflow/lifecycle/WorkflowLifecycleEventType.java Adds TIMEOUT workflow lifecycle event type.
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/task/lifecycle/event/TaskTimeoutLifecycleEvent.java Tightens timeout precondition to > 0.
dolphinscheduler-dao/src/main/java/org/apache/dolphinscheduler/dao/AlertDao.java Validates inputs and unifies timeout alert persistence helper method.
dolphinscheduler-common/src/main/java/org/apache/dolphinscheduler/common/enums/AlertType.java Updates enum comment for alert type codes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +54 to +55
long delayTime = System.currentTimeMillis() - workflowInstance.getRestartTime().getTime()
+ TimeUnit.MINUTES.toMillis(timeout);
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WorkflowTimeoutLifecycleEvent.of assumes workflowInstance.getRestartTime() is non-null and uses it to compute delayTime. In practice restartTime can be null (there are call sites that explicitly fall back to startTime when it's missing), which would cause an NPE and prevent timeout handling. Additionally, the current delay calculation adds the elapsed time since restartTime to the configured timeout, which schedules the event later than the configured timeout if there is any delay between setting restartTime and publishing the event. Consider falling back to startTime when restartTime is null and computing the remaining delay until (restart/start time + timeout) (clamping to 0 if already exceeded).

Suggested change
long delayTime = System.currentTimeMillis() - workflowInstance.getRestartTime().getTime()
+ TimeUnit.MINUTES.toMillis(timeout);
final long baseTimeMillis =
workflowInstance.getRestartTime() != null
? workflowInstance.getRestartTime().getTime()
: (workflowInstance.getStartTime() != null
? workflowInstance.getStartTime().getTime()
: 0L);
checkState(baseTimeMillis > 0,
"The workflow instance startTime and restartTime must not both be null when creating workflow timeout event.");
final long timeoutMillis = TimeUnit.MINUTES.toMillis(timeout);
final long deadlineMillis = baseTimeMillis + timeoutMillis;
final long delayTime = Math.max(0L, deadlineMillis - System.currentTimeMillis());

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need, restartTime is not null

.dryRun(workflowTriggerDTO.getDryRun())
.taskDependType(workflowTriggerDTO.getTaskDependType())
.failureStrategy(workflowTriggerDTO.getFailureStrategy())
.warningGroupId(workflowTriggerDTO.getWarningGroupId())
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WorkflowTriggerDTO now contains warningType, but manualTriggerWorkflow doesn't propagate it into WorkflowManualTriggerRequest (only warningGroupId is set). This makes the DTO field ineffective and can lead to confusion when writing tests that intend to control warningType. Either remove warningType from the DTO or pass it through to the request builder.

Suggested change
.warningGroupId(workflowTriggerDTO.getWarningGroupId())
.warningGroupId(workflowTriggerDTO.getWarningGroupId())
.warningType(workflowTriggerDTO.getWarningType())

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@njnu-seafish njnu-seafish requested a review from SbloodyS March 25, 2026 03:10
@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 60%)

See analysis details on SonarQube Cloud

@njnu-seafish
Copy link
Contributor Author

@ruanwenjun @SbloodyS Whenever you have time, I’d be grateful for your review. Thanks so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend bug Something isn't working test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] [Master] Missing workflow timeout alerts

4 participants